The purpose of this project is to create a machine learning model that can predict the price of a diamond based on its charateristics. This report will start with an initial exploratory data analysis (EDA). The EDA will be filled with data visualizations and inital Q&As to help understand a data set. The EDA is then followed by a model exploration. Different machine learning alogrithms will be used to create predictive models. Models will then be compared against each other and the most accurate predictive model will be determined. The conclusion section will provide an executive summary of data insights and the selected predictive model.
This inital exploratory analysis will attempt to answer the following questions:
First Look: What does the data look like? What are some initial descriptive statistics and insights.
Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?
Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?
Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?
Bigger is Better: Imagine a guy named Bob who wants to buy a pair of diamonds for his wife, and have them made into earrings for her birthday. In Bob’s mind, size (carat) is all that matters. He has $3200. He needs Two diamonds with the exact same cut, color, and clarity (with very comparable depth and table values), and he wants them to be as big as possible. What size carat can he afford? If he adjusts his budget, how much does the “maximum carat” size shift? Can we plot that and fit a line to it to find the “knee in the curve”?
This data set is provided in the ggplot2 package from CRAN. Let’s load the tidyverse collection of packages, which includes ggplot2 and some other packages (like dplyr) which are useful for data manipulation. Let’s also load the plotly package, which allows us to create advanced visualizations.
# Load libraries needed for EDA
library(tidyverse)
library(plotly)
# save diamonds data into an object called data
data <- ggplot2::diamonds
A sample of the data is provided below:
head(data)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Looks like there is a mix of numeric and categorical data. The numeric data (carat, depth, table, x, y, & z) are doubles (dbl), which means they are numerics with decimal places. Price is a numeric integer, so it will not have any decimal places. The categorical data (cut, color, and clarity) are ordered (ord), which means it has a specific order, such as best to worst. Similar to the sample above, the glimpse function provides a sample of the data in a different format that is more preferable when a data set is very large.
glimpse(data)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
There are 53940 rows of data. Each row is one observation / one diamond. There are 10 columns of data. Each column is a variable / feature. Below is a quick statistical summary of each feature:
summary(data)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
A minimum, 1st quartile, median, mean, 3rd quartile, and maximum is provided for all the numeric data. For example, the median price of diamonds in this data set is 2401. A data count is provided for all categorical data. For example, there are 21551 diamonds that are Ideal cut in this data set. Since all the categorical data is ordered, the counts are in its pre-specified order. For the example, the order for color is: D, E, F, G, H, I, J. Below is a definition for each feature:
carat: weight of the diamond (0.2 to 5.01)
cut: quality of the cut (in order of worst to best; Fair, Good, Very Good, Premium, Ideal)
color: diamond color (in order of best to worst; D, E, F, G, H, I, J)
clarity: a measurement of how clear the diamond is (in order of worst to best; I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
depth: total depth percentage (total height divided by total width) = 2 * z / (x + y) (43 to 79)
table: width of the top of a diamond relative to its widest point (43 to 95)
price: price in US dollars (326 to 18823)
x: length in mm (0 to 10.74)
y: width in mm (0 to 58.9)
z: depth in mm (0 to 31.8)
My OCD is kicking in a little bit for color. It is in the order of best to worst, while cut and clarity are in order of worst to best. The script below reorders color from worst to best for consistency. This will have no effect what-so-ever on the analysis, it’s just my personal preference.
# Reordering color so it's from worst to best. Just like all the other ordered features.
# This is personal preference. It's not a necessary step.
data <- data %>% # Create a new object, data, from the old object, data
mutate( # Change its columns
color = ordered(color, # Make the color column and ordered column
levels = c( # The order/levels are as follows:
"J",
"I",
"H",
"G",
"F",
"E",
"D"
)))
Now we can get to the interesting stuff. Let’s look at how price is affected by each individual feature.
Below is a scatter plot of price vs carat:
ggplot(data, aes(x=carat, y=price)) + # Plot the object data, with carat mapped to the x axis and price mapped to the y axis
geom_point(alpha=0.1) # Make the plot a scatter plot. Make each point 10% transparent/alpha
Price increases at an increasing rate as carat increases.
Below is a box-plot of price by cut.
{ggplot(data, aes(x=cut, y=price)) + # Plot the object data, with cut mapped to the x axis and price mapped to the y axis
geom_boxplot()} %>% # Make the plot a box plot
ggplotly() # Make the plot interactive with the ggplotly function
Below is a box-plot of price by color.
{ggplot(data, aes(x=color, y=price))+
geom_boxplot()} %>%
ggplotly()
Below is a box-plot of price by clarity.
{ggplot(data, aes(x=clarity, y=price))+
geom_boxplot()} %>%
ggplotly()
Below is a scatter plot of price vs depth percentage.
ggplot(data, aes(x=depth, y=price))+
geom_point(alpha=0.1)
It looks like as the range of depth narrows, the price increases. It appears more expensive diamonds are roughly around 58% to 63% depth percentage.
Below is a scatter plot of price vs table percentage.
ggplot(data, aes(x=table, y=price))+
geom_point(alpha=0.1)
This looks similar to depth percentage. As the range of table percentage narrows, the price increases. Although, it doesn’t to appear to be as drastic as depth percentage.
Below is a scatter plot of price vs length (x).
ggplot(data, aes(x=x, y=price))+
geom_point(alpha=0.1)
This looks similar to carat. Price increases at an increasing rate as length increases. There’s also some data points that have a length of zero. These must be missing data points and should be converted to NAs. If they are not, this can mess up some predictive models down the line. First let’s find out how many zeros there are.
df <- data %>% # Create a new object df from the object data
filter(x==0) # show only the data that has a zero value for x
print(df) # Display the object df
## # A tibble: 8 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 2 1 Very Good H VS2 63.3 53 5139 0 0 0
## 3 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 4 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 5 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 6 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 7 0.71 Good F SI2 64.1 60 2130 0 0 0
## 8 0.71 Good F SI2 64.1 60 2130 0 0 0
There are 8 rows of data with x labeled as zero and there are 53940 rows of data total. This is a small percent (1.483129410^{-4}) of the total data, but this should still be addressed. These values will be converted into NAs and what to do with them will be determined during model exploration. These rows of data could be removed from the data set completely, or a median length can be assumed, or potentially x can be calculated from depth and table percentage. Is it even worth addressing it all? This will be determined later in this project.
I also see zeros for y and z. The script below determines how many zeros there are for width (y):
df <- data %>%
filter(y==0)
nrow(df) # display how many rows are in object df
## [1] 7
And for depth (z):
df <- data %>%
filter(z==0)
nrow(df)
## [1] 20
There are 20 missing data points for z. That is 3.707823510^{-4} percent of the data. It’s slightly more that x and y, but still a small sample. The script below will convert all zeros for x, y, and z into NAs, specifically numeric NAs.
data <- data %>% # create a new object called data, from the old object called data
mutate( # change the columns in the old object
x = case_when( # Change the column x by the following conditions
x==0 ~ NA_real_, # When x is 0, make it a NA
TRUE ~ x), # For everything else, keep x as is
y = case_when( # same logic as x
y==0 ~ NA_real_,
TRUE ~ y),
z = case_when( # same logic as x
z==0 ~ NA_real_,
TRUE ~ z))
All zeros are now NAs.
Below is a scatter plot of price vs width (y).
ggplot(data, aes(x=y, y=price))+
geom_point(alpha=0.1)
First off, there are some outliers past a y of 20. Let’s see what these points are.
df <- data %>%
filter(y>20)
print(df)
## # A tibble: 2 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
## 2 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12
These values for y can’t be right. One way to double check is to use the equation for depth percentage: depth = 2 * z/(x + y). Using a little algebra we can solve for y: y = (2z/depth) - x. Before using these equations, we need a proof of concept. Depth and y will be calculated using these equations and we’ll check if they equal the depth and y provided in the data. If they do, the equations are true.
df <- data %>%
select(depth, table, x, y, z) %>% # Select only these columns from data
mutate(
depth_calc = round(((2*z)/(x+y))*100, # create a new column, depth_calc using the equation
digits = 1), # round to the nearest one digit
y_calc = round((2*z/(depth/100))-x, # create a new column, y_calc using the equation
digits = 2), # round to the nearest two digits
depth_check = depth==depth_calc, # create depth_check, which is true if depth equals depth_calc
y_check = y==y_calc) %>% # create y_check, which is true if y equals y_calc
select(depth, depth_calc, depth_check, y, y_calc, y_check) # Select only these columns
head(df, n = 20) # show the first 20 rows
## # A tibble: 20 x 6
## depth depth_calc depth_check y y_calc y_check
## <dbl> <dbl> <lgl> <dbl> <dbl> <lgl>
## 1 61.5 61.3 FALSE 3.98 3.95 FALSE
## 2 59.8 59.8 TRUE 3.84 3.84 TRUE
## 3 56.9 56.9 TRUE 4.07 4.07 TRUE
## 4 62.4 62.4 TRUE 4.23 4.23 TRUE
## 5 63.3 63.3 TRUE 4.35 4.35 TRUE
## 6 62.8 62.8 TRUE 3.96 3.96 TRUE
## 7 62.3 62.3 TRUE 3.98 3.98 TRUE
## 8 61.9 61.9 TRUE 4.11 4.1 FALSE
## 9 65.1 65.1 TRUE 3.78 3.78 TRUE
## 10 59.4 59.4 TRUE 4.05 4.05 TRUE
## 11 64 64 TRUE 4.28 4.28 TRUE
## 12 62.8 62.8 TRUE 3.9 3.9 TRUE
## 13 60.4 60.4 TRUE 3.84 3.84 TRUE
## 14 62.2 62.2 TRUE 4.37 4.36 FALSE
## 15 60.2 60.2 TRUE 3.75 3.75 TRUE
## 16 60.9 60.9 TRUE 4.42 4.42 TRUE
## 17 62 62 TRUE 4.34 4.34 TRUE
## 18 63.4 63.4 TRUE 4.29 4.29 TRUE
## 19 63.8 63.8 TRUE 4.26 4.27 FALSE
## 20 62.7 62.7 TRUE 4.27 4.27 TRUE
It looks like the equations work for the most part. In some instances they don’t match 100%, but they are in the ballpark. This may be due to rounding errors. Now that these equations are confirmed. Let’s calculate y for the two outliers.
df <- data %>%
filter(y>20) %>%
mutate(y_calc = round((2*z/(depth/100))-x,
digits = 2))
print(df)
## # A tibble: 2 x 11
## carat cut color clarity depth table price x y z y_calc
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06 19.3
## 2 0.51 Ideal E VS1 61.8 55 2075 5.15 31.8 5.12 11.4
These values seem a lot more reasonable. Now let’s commit this change to the data and re-plot.
data <- data %>%
mutate(
y = case_when(
y>20 ~ round((2*z/(depth/100))-x,digits = 2),
TRUE ~ y))
ggplot(data, aes(x=y, y=price))+
geom_point(alpha=0.1)
Those data points still look like they may be outliers but they are more reasonable than before, and trend is easier to see in this chart. Price increases at an increasing rate as y increases.
Below is a scatter plot of price vs depth (z).
ggplot(data, aes(x=z, y=price))+
geom_point(alpha=0.1)
Looks like there is an outlier past a depth (z) of 30. This may be a similar case as seen with width (y). Below shows this data point:
df <- data %>%
filter(z>30)
print(df)
## # A tibble: 1 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8
This doesn’t pass the common sense test. Using the depth percentage equation we can calcuate depth (z) by using: z = depth*(x+y)/2. The results are below:
df <- df %>%
mutate(
z_calc = round(((depth/100)*(x+y))/2, digits = 2)
)
print(df)
## # A tibble: 1 x 11
## carat cut color clarity depth table price x y z z_calc
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8 3.17
The calculated z looks more reasonable. The script below commits this change to the data and re-plots.
data <- data %>%
mutate(
z = case_when(
z>30 ~ round(((depth/100)*(x+y))/2, digits = 2),
TRUE ~ z
)
)
ggplot(data, aes(x=z, y=price))+
geom_point(alpha=0.1)
That’s better! This trend appears to be the same as carat, x, and y. Now, onto some more detailed data exploration.
Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?
First we need to filter the data set to only the best cut, color, clarity, depth, and table.
df <- data %>%
filter(
cut=="Ideal",
color=="D",
clarity=="IF",
between(depth, 59, 63),
between(table, 54, 57))
print(df)
## # A tibble: 24 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Ideal D IF 62 56 3446 5.14 5.18 3.2
## 2 0.51 Ideal D IF 62.1 55 3446 5.12 5.13 3.19
## 3 0.53 Ideal D IF 61.5 54 3517 5.27 5.21 3.22
## 4 0.53 Ideal D IF 62.2 55 3812 5.17 5.19 3.22
## 5 0.59 Ideal D IF 60.9 57 4208 5.4 5.43 3.3
## 6 0.56 Ideal D IF 62.4 56 4216 5.24 5.28 3.28
## 7 0.56 Ideal D IF 61.9 57 4293 5.28 5.31 3.28
## 8 0.63 Ideal D IF 62.5 55 6549 5.47 5.5 3.43
## 9 0.63 Ideal D IF 62.5 55 6607 5.5 5.47 3.43
## 10 1.04 Ideal D IF 61.8 57 14494 6.49 6.52 4.02
## # ... with 14 more rows
This filters the data to only 24 options. Let’s plot carat versus price.
ggplot(df, aes(x=carat, y=price)) +
geom_point()
Looks like a linear model could be viable. Let’s create one.
model_df_lm <- lm(price ~ carat, df)
summary(model_df_lm)
##
## Call:
## lm(formula = price ~ carat, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2123.5 -1264.7 245.6 1026.5 2344.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5621.9 623.1 -9.023 7.57e-09 ***
## carat 20259.9 911.8 22.221 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1288 on 22 degrees of freedom
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9554
## F-statistic: 493.8 on 1 and 22 DF, p-value: < 2.2e-16
Carat passes the T test (p-value) by a mile and the R squared isn’t too bad. Let’s plot this model now.
pred_df_lm <- predict(model_df_lm, df)
df <- df %>%
mutate(pred = pred_df_lm)
ggplot(df, aes(x=carat, y=price)) +
geom_point(color="blue") +
geom_line(color="red", aes(y=pred))
Looks like the most expensive diamond might not be the largest one. Let’s find these two data points.
df2 <- df %>%
filter(
carat == max(carat) |
price == max(price)
)
print(df2)
## # A tibble: 2 x 11
## carat cut color clarity depth table price x y z pred
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.07 Ideal D IF 60.9 54 17042 6.66 6.73 4.08 16056.
## 2 1.03 Ideal D IF 62 56 17590 6.55 6.44 4.03 15246.
Interesting! The most expensive diamond is the not the largest diamond, but it has a larger depth and table compared to the largest carat diamond. This needs a closer look.
Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?
First let’s create a scatter plot of price vs carat.
ggplot(data, aes(x=carat, y=price)) +
geom_point()
Next, let’s create a scatter plot of price vs depth.
ggplot(data, aes(x=depth, y=price)) +
geom_point()
Now let’s create a scatter plot of price vs table.
ggplot(data, aes(x=table, y=price)) +
geom_point()
A 3D scatter plot of price vs carat vs depth.
plot_ly(data=data, x=~price, y=~carat, z=~depth, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~cut, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~color, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~clarity, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
A 3D scatter plot of price vs carat vs table.
plot_ly(data=data, x=~price, y=~carat, z=~table, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="Table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~cut, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~color, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~clarity, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?
Under development.
Under development.
Under development.
Under development.
Under development.
Throughout this analysis some assumptions and data cleaning steps have been made. These decisions will impact how each predictive model will perform. A sensitivity analysis is recommended for the following data cleaning issues:
1. Issue: x, y, and z values labeled as zeros.
Action took in this analysis: Make all zeros NAs and ignore NAs in model exploration.
2. Issue: Potential outliers for y and z.
Action took in this analysis: Estimate y and z using the depth equation